-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Add learning rate scheduling support for DeepSpeedStrategy
#20320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Thanks for the contribution @amorehead! Let's get to a green CI and take it from there |
hey @amorehead looks like CI failures are legit, let me know if you can fix those |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @amorehead! I added a few comments. Essentially we need to turn this into a non-breaking change.
Also a small update to docs is needed.
Currently, only a single optimizer is supported. | ||
self, module: Module, optimizers: list[Optimizer], scheduler: Optional[_LRScheduler] = None | ||
) -> tuple["DeepSpeedEngine", list[Optimizer], Optional[_LRScheduler]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will return None
, we need to return Any
here so we can ignore the scheduler
if it is not provided in input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Addressed in this commit.
src/lightning/fabric/fabric.py
Outdated
@@ -266,7 +269,7 @@ def setup( | |||
|
|||
if optimizers: | |||
# join both types in a tuple for API convenience | |||
return (module, *optimizers) | |||
return (module, *optimizers, scheduler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change, it will cause existing user code to fail, because scheduler
is returned unconditionally.
Since scheduler
is Optional
in the signature, I suggest we only return it if it was not None
as an argument, so we won't break anyone's code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Addressed in this commit.
optimizer: Optional[Optimizer] = None, | ||
) -> tuple["DeepSpeedEngine", Optimizer]: | ||
self, model: Module, optimizer: Optional[Optimizer] = None, scheduler: Optional[_LRScheduler] = None | ||
) -> tuple["DeepSpeedEngine", Optimizer, Optional[_LRScheduler]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in this commit.
@@ -104,7 +104,10 @@ def pl_worker_init_function(worker_id: int, rank: Optional[int] = None) -> None: | |||
if _NUMPY_AVAILABLE: | |||
import numpy as np | |||
|
|||
np.random.seed(seed_sequence[3] & 0xFFFFFFFF) # numpy takes 32-bit seed only | |||
ss = np.random.SeedSequence([base_seed, worker_id, global_rank]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an unrelated change, it shouldn't be included
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is now merged into master
per this previous pull request, is this comment still relevant?
@amorehead I'm wrapping up the last few PRs for the release. Do you have time to fix this one in the next couple of days? |
@lantiga, apologies. Just now getting to fixing this pull request up. I've updated the |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #20320 +/- ##
==========================================
+ Coverage 41% 87% +46%
==========================================
Files 265 268 +3
Lines 23394 23455 +61
==========================================
+ Hits 9476 20395 +10919
+ Misses 13918 3060 -10858 |
@amorehead mind check the last failng case:
|
@Borda, I've just fixed this test |
seems one left:
|
@Borda, let's see if this latest commit of mine fixes it. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://lightning.ai/docs/pytorch/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Discord. Thank you for your contributions. |
@Borda, may I ask for you to check the "Read the Docs" tests and why they are failing? |
What does this PR do?
DeepSpeedStrategy
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist
📚 Documentation preview 📚: https://pytorch-lightning--20320.org.readthedocs.build/en/20320/